Why do we use charts to tell stories?
Evidence-based visual perception theory
Advice on choosing charts
Advice on using colour in charts
Using this advice to tell stories with charts built with {ggplot2}
A picture is worth a thousand words
There is considerable experimental evidence for data visualisations improving:
Comprehension of data
Decision making accuracy and confidence
Evidence has been collected using eye-tracking, survey filling and interviews.
For a good overview of the available research see Eberhard 20211.
Some of these studies consider tables to be a type of data visualisation.
I agree with this! Tables are often awesome choices for presenting data - let’s talk more about this later today.
In 1973 Anscombe2 published a paper designed to demonstrate…
Graphs are essential to good statistical analysis.
To do so he simulated 4 datasets sharing many identical statistical properties.
However, if you visualised the datasets it was obvious these datasets were fundamentally different to one another.
These charts are now known as Anscombe’s quartet2.
The “Datasaurus Dozen” is a modern reimagining of the original quartet3.
Datasaurus was originally created by Alberto Cairo4.
… there’s now an R package for building your own metamers eliocamp.github.io/metamer/
There are several historical visualisations that have fundamentally changed social policy and behaviour.
This is a map from John Snow in 18555 that ties a cholera outbreak to a specific water pump.
Combined with Snow’s statistical analyses this was a significant step towards the development and acceptance of germ theory.
In exactly the same year, Florence Nightingale6 was creating charts to demonstrate the importance of basic sanitation in military hospitals.
This specific chart is very dramatic and quite rarely used. It’s a polar area diagram or a Nightingale rose diagram
But it’s important to acknowledge that Nightingale used many different types of charts in her work.
Her charts and analyses were central to bringing basic sanitation standards to nursing and hospitals.
In 2006 Hans Rosling7 gave an incredible TED talk where he introduced animated bubble charts as a tool to tell stories about global development.
These charts helped demonstrate the value of interactive and animated data visualisations - which is why Google bought the tool behind the charts!
A more recent example of a very powerful data visualisation is the spiralling global temperature GIF from 2016 by Ed Hawkins8.
We can create animated GIF with {ggplot2} via the {gganimate} package. In fact, Pat Schloss9 has a YouTube video and GitHub repo recreating this chart with R.
There is a wealth of evidence-based research in how precisely or accurately charts are perceived by readers.
Our evidence comes from:
Eye tracking. We’re really good at measuring where the eye is looking, for how long and how intently.
Asking trial participants to estimate or compare values in charts.
There are open debates1 on how our internal visual perception system works - what the brain is doing.
1A good example is pie charts where we’re still not sure what our brains are doing, but we know they’re not measuring area thanks to Robert Kosara10
Back in 1984 Cleveland & McGill11 published their seminal paper on graphical perception theory where they defined “elementary perceptual tasks”.
This study is the backbone of much of the research in this field.
Cleveland & McGill11 designed many experiments where participants were asked to:
Identify the largest/smallest segment
Estimate what % the smaller segment was of the larger segment
The accuracy of subject estimates was then statistically analysed.
Images from Beecham et al13
Images from Robert Kosara14
Images from Robert Kosara14
Image found on Twitter from @irg_bio15 - code for chart available from GitHub16.
To extract accurate values
The magnitude of chart elements.
To quantatively compare values.
The part to whole or relative magnitude of chart elements.
To find the largest/smallest value.
The ranking of chart elements
To find unusual values.
The distribution, ranking or magnitude of chart elements
You have a story you want to tell
There’s lots we can do to help guide the reader to understand your chart and follow the story you’re telling. We’ll cover some examples during this course.
The reader wants to see the data
Charts (and tables) are the best way to see the “big picture” of a dataset - a single value (eg mean) is kind of useless. Interactivity is really useful to allow readers to properly explore the dataset.
The reader has a preconception about the data
Readers might be approaching a chart biased with a particular theory about the data. We can do our best to make our charts easy to read and avoid common pitfalls.
This site also provides simple to follow instructions for using {ggplot2} to build every single chart type you can find on the website.
The Visual Vocabulary is a really useful tool for thinking about how to tell your story with a chart.
Lots of the dataviz at the FT is done with R. John Burn-Murdoch17 is a great source to follow.
SLIDE 1 OF 3
Create a new project called something like week-4_dataviz.Rproj
Add a new RMarkdown document called ggplot2-notes.Rmd
We’re going to do some structured and unstructured code during today. During the workshop I’ll be asking to you to create your own charts.
{ggplot2} is an incredibly powerful and flexible tool for building static dataviz.
We can build (almost)1 any static chart we can conceive of.
[1] - Dual y-axis charts must be transformations of one another (for good reasons)
Aesthetics
Geoms
Scales
Guides
Theme
| Where is aes() placed? | What it does |
|---|---|
Inside ggplot() or on its own |
Sets the aesthetics for the entire {ggplot2} object. These could be considered the coordinate system aes() |
Inside geom_*() |
Sets aesthetics for a specific geom within the existing coordinate system aes() for the {ggplot2} object. These should be considered geom specific aes() |
Geoms use the aesthetics to add layers to our charts.
There are 50+ geoms baked into the {ggplot2} package.
geom_abline(), geom_area(), geom_bar(), geom_bin2d(), geom_blank(), geom_boxplot(), geom_col(), geom_contour(), geom_contour_filled(), geom_count(), geom_crossbar(), geom_curve(), geom_density(), geom_density_2d(), geom_density_2d_filled(), geom_density2d(), geom_density2d_filled(), geom_dotplot(), geom_errorbar(), geom_errorbarh(), geom_freqpoly(), geom_function(), geom_hex(), geom_histogram(), geom_hline(), geom_jitter(), geom_label(), geom_line(), geom_linerange(), geom_map(), geom_path(), geom_point(), geom_pointrange(), geom_polygon(), geom_qq(), geom_qq_line(), geom_quantile(), geom_raster(), geom_rect(), geom_ribbon(), geom_rug(), geom_segment(), geom_sf(), geom_sf_label(), geom_sf_text(), geom_smooth(), geom_spoke(), geom_step(), geom_text(), geom_tile(), geom_violin(), geom_vline()
As we’ll see later, there are many {ggplot2} extension packages that add even more geoms to the mix.
But geom_bar() itself is built from geom_rect().
There are 8 primitives from which all other geoms are built:
geom_blank(), geom_path(), geom_point(), geom_polygon(), geom_rect(), geom_ribbon(), geom_segment(), geom_text()
x and y aestheticsThese tell the geom where it needs to be drawn:
x and yLet’s geom_segment() to visualise some of the eras of the dinosaurs:
To build this chart we need to specify all of the following: x, xend, y and yend.
size to affect geom sizeIn many charts we want geoms to be thicker, bigger or just be more prominent.
Timeline (or Gantt charts) are good examples of this. We want the segments to be thicker to improve the readability of the chart - this comes down to the size aesthetic.
This is still a bad chart.
The eras are not ordered in geological time, instead they’re ordered (reverse) alphabetically.
To control the order of things in {ggplot2} charts we must use factors - which are picked up by the scales.
stat functionsThe geom_bar() function has a stat argument with the default value of "count".
We can force the geom to behave like geom_col() by changing the stat:
All of the goodness from the stat argument comes from the stat_identity() and stat_count() functions.
If you’re building a complex chart it might be useful to directly call a stat_() function.
Box and whisker diagrams hide a lot of detail
Let’s add the data points to this chart with geom_point() and look at the position argument.
The position argument can also be used to create three different types of bar chart:
“stack” creates a stacked bar chart
“fill” creates a proportional bar chart
“dodge” creates a grouped bar chart
Let’s create all 3 of these for the following dataset:
# A tibble: 78 × 3
relig marital n
<fct> <fct> <int>
1 No answer No answer 4
2 No answer Never married 22
3 No answer Separated 3
4 No answer Divorced 13
5 No answer Widowed 7
6 No answer Married 44
7 Don't know Never married 6
8 Don't know Separated 3
9 Don't know Divorced 1
10 Don't know Married 5
# … with 68 more rows
The geom_smooth() line is hiding data points.
We could either swap the order of these geoms or change the alpha aesthetic.
Scales determine the appearance of an aesthetic within the chart, including:
Axes labels and breaks
Colours used for colour and fill aesthetics
Scales also determine the order in which elements are shown in a chart.
To change the order of discrete/categorical columns we need to use factors.
{ggplot2} uses the {scales} package under the hood to build all of the scales that we see - including continuous and discrete scales.
The {scales} package also contains many utility functions that are useful for us to format our axes and other scales.
We can either load the {scales} package itself or call functions specifically with scales::label_percent()
Until recently the way we’d use {scales} would be as follows
There was a function called percent(x) for formatting a vector of values x and percent_format() for modifying the appearance of percentages in a {ggplot2} chart.
These functions have now been deprecated. This means there are new alternatives to these functions.
Deprecation is a fact of life in software development. But the details of how things are deprecated are variable.
Sometimes things are deprecated with the intention of removing them in the future. Other times, the deprecated functions will continue to exist far into the future.
It seems like the intention is for these functions to continue to work into the future. But they might be removed in several years time.
Let’s use the new approach for formatting scales so that you can read modern documentation and so you’re not learning deprecated functions.
We now use label_percent() for both types of operation.
This is known as a function factory.
Function factories are cool. But I wish you didn’t have to learn this syntax.
There are many built-in colour palettes in {scales} - let me introduce two families of palettes.
The website colorbrewer2.org contains several palettes differentiated into sequential, diverging and qualitative.
There are many built-in colour palettes in {scales} - let me introduce two families of palettes.
There are some pretty good palettes for discrete/categorical variables in this family of palettes.
There are many built-in colour palettes in {scales} - let me introduce two families of palettes.
There are some pretty good palettes for discrete/categorical variables in this family of palettes.
But for continuous variables I strongly recommend using the viridis family of palettes.
These are designed to be both perceptually uniform and to work for folks with colour blindness.
One of the first frustrations people find with {ggplot2} is setting our own custom colours, eg in this chart:
We need to use scale_fill_manual()
We’ll come back to this chart in the section on guides().
Factors are R’s categorical data type. They allow us to create a variable with fixed values (levels) and to set the order of those levels.
Let’s look at a pre-existing dataset with factors:
[1] $8000 to 9999 $8000 to 9999 Not applicable Not applicable Not applicable
[6] $20000 - 24999
16 Levels: No answer Don't know Refused $25000 or more ... Not applicable
The base R tools for creating and manipulating factors are messy and frustrating to use.
We’re going to use the {forcats} package which is loaded when we run library(tidyverse).
Almost all of the functions begin with fct_*() to let you know we’re dealing with factors.
Let’s think of the different ways we could order this dataset:
# A tibble: 5 × 2
vore n
<chr> <int>
1 carni 19
2 herbi 32
3 insecti 5
4 omni 20
5 <NA> 7
Count order
In this ordering we will arrange the vore column according to values in the n column.
This is usually what we want in count bar charts.
Canonical order
In this ordering we’ll arrange the vore column from the diet with the most meat to the least meat.
This is usually what we want in visualising survey datasets,
We use fct_reorder() to order a factor by another column.
But what about the NA values? What should we do?
We can replace NA values nicely with fct_explicit_na()
Let’s come back to moving the position of the NA level.
To set our own canonical order we use fct_relevel() and provide a vector with our preferred order.
We can also use fct_relevel() to modify the position of a specific
SLIDE 1 OF 3
These are the same steps you’ve repeated before
Add a sub-folder to your project called data
Inside of the data folder add a script called obtain-data.R
Add this code
5. Run the code
SLIDE 2 OF 3
1. Add a new heading for the GBD Dataset to your .Rmd
2. Filter the dataset as follows:
Most recent year
location_name starts with “World Bank”
metric_name is “Number”
cause_name is “Injuries”
3. Select only these columns
location_name, cause_name, valSLIDE 2 OF 3
Create two versions of this chart:
Bars are ordered by their size
Bars are ordered from “World Bank High Income” to “World Bank Low Income”
Sometimes we want to add additional legend items - usually for NA values, and particularly for maps.
Let’s continue with this chart from before:
We need to choose an aesthetic that works for geom_col() but we’re not using elsewhere in the chart.
This will change depending on your chart. In this instance we can use size
We now set the na.value colour for the original scale_fill_manual() scale
Next we use the guides() function to override the values for the size legend
msleep %>%
count(vore) %>%
ggplot() +
aes(x = n,
y = vore,
fill = ifelse(vore == "herbi", "No meat", "Some meat")) +
geom_col(aes(size = "Unknown diet")) +
scale_fill_manual(values = c("Some meat" = "red",
"No meat" = "darkgreen"),
name = "",
na.value = "blue") +
guides(size = guide_legend(title = "",
override.aes = list(fill = "blue")))If we want to modify the size of legend items we have two choices:
guides(fill = guide_colourbar(barwidth = 0.5, barheight = 10))
… or to set the sizes in the theme().
There are over 92 arguments to the theme() function for controlling chart appearance.
Remembering them all is challenging - I usually google them! Or use guides like this one:
Source: https://bookdown.org/alapo/learnr/data-visualisation.html
{ggplot2} has several built-in themes. They have several arguments for quickly customising them.
I’ve been using the default theme_gray() to change text size in charts.
The {ggthemes} package contains lots of really useful - and beautiful - themes.
It’s recommended that you choose a theme close to what to want and then customise it.
Most of the legend arguments expect one of these functions:
element_line()
element_text()
element_rect()
Or element_black() if you want to remove a theme element.